Add MiniMax-M2.5 FP8 vLLM benchmark for B200 by functionstackx · Pull Request #757 · SemiAnalysisAI/InferenceX

functionstackx · 2026-02-19T02:45:21Z

Add MiniMax-M2.5 FP8 vLLM benchmark configuration for B200 GPUs.

New benchmark script benchmarks/single_node/minimaxm2.5_fp8_b200.sh
Config entry minimaxm2.5-fp8-b200-vllm in nvidia-master.yaml
Image: vllm/vllm-openai:v0.16.0-cu130
TP=2 and TP=4, concurrency 4-64
Sequence lengths: 1k1k, 1k8k, 8k1k

Closes #756

- Add benchmark script benchmarks/minimaxm2.5_fp8_b200.sh - Add config entry minimaxm2.5-fp8-b200-vllm to nvidia-master.yaml - Image: vllm/vllm-openai:v0.15.1-cu130 - Model: MiniMaxAI/MiniMax-M2.5 - TP=2 and TP=4, concurrency 4-64 - Sequence lengths: 1k1k, 1k8k, 8k1k - Update perf-changelog.yaml Closes #756 Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>

kedarpotdar-nv · 2026-02-20T02:56:27Z

@claude please review. i made a few changes

Klaud-Cold · 2026-02-20T02:56:43Z

Claude finished @kedarpotdar-nv's task in 2m 40s —— View job

Review of PR #757

Bug: Wrong `benchmark_lib.sh` source path

benchmarks/single_node/minimaxm2.5_fp8_b200.sh:3 sources benchmark_lib.sh with:

source "$(dirname "$0")/benchmark_lib.sh"

This resolves to benchmarks/single_node/benchmark_lib.sh, which does not exist. The file is at benchmarks/benchmark_lib.sh. Every other script in single_node/ uses /../:

source "$(dirname "$0")/../benchmark_lib.sh"

This will cause the script to fail immediately at runtime.

Fix this →

Benchmark Script (`minimaxm2.5_fp8_b200.sh`)

Good changes:

--enable-expert-parallel (line 32) is appropriate — MiniMax-M2.5 is a MoE model (456B total, ~47B active params). The MI355X version (minimaxm2.5_fp8_mi355x.sh) does NOT have this flag, so this is a deliberate addition for B200.
--trust-remote-code on both the vLLM serve and benchmark client side is correct for this model.
--block-size=32, --gpu-memory-utilization 0.95, and the overall script structure match existing patterns well.

No explicit FP8 quantization flags: The script doesn't pass --quantization fp8 or --kv-cache-dtype fp8. This is fine since MiniMax-M2.5 appears to ship with native FP8 weights, matching the approach in the MI355X version.

Config (`nvidia-master.yaml`)

The config entry at line 3018 looks correct:

runner: b200-nvs — matches the appropriate B200 runner
image: vllm/vllm-openai:v0.15.1-cu130 — new vLLM version, first config to use this tag
Search space (TP=2 and TP=4, conc 4-64) across all three seq-len configs is reasonable

No issues found here.

Changelog (`perf-changelog.yaml`)

Content is correct. Minor: trailing whitespace on line 690 (pr-link: line) and trailing blank line at 692. Not a blocker.

Summary

One blocking issue: the benchmark_lib.sh source path bug will prevent the script from running. Everything else looks good — the --enable-expert-parallel addition and overall config are solid.
| Branch

functionstackx · 2026-02-24T02:28:27Z

@claude implement this fix in this PR's runners/launch_b200-dgxc-slurm.sh too

Klaud-Cold · 2026-02-24T02:28:45Z

Claude finished @functionstackx's task in 1m 5s —— View job

Implementing fix in `runners/launch_b200-dgxc-slurm.sh`

Gather context and understand the fix from the screenshot
Implement the fix — change HF_HUB_CACHE_MOUNT path from /home/sa-shared/gharunners/hf-hub-cache/ to /scratch/fsw/gharunners/hf-hub-cache
Commit and push changes

Updated HF_HUB_CACHE_MOUNT path in runners/launch_b200-dgxc-slurm.sh:216 from /home/sa-shared/gharunners/hf-hub-cache/ to /scratch/fsw/gharunners/hf-hub-cache to match the correct mount point.

View job | Branch

Change HF_HUB_CACHE_MOUNT from /home/sa-shared/gharunners/hf-hub-cache/ to /scratch/fsw/gharunners/hf-hub-cache Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>

cquil11 · 2026-02-25T19:07:01Z

Updated sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22411684877

functionstackx · 2026-02-26T06:51:34Z

@ankursingh-nv i cancelled the workflow as it seems to be hard erroring out

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

functionstackx · 2026-03-05T04:02:46Z

benchmarks/single_node/minimaxm2.5_fp8_b200.sh

+--gpu-memory-utilization 0.95 \
+--max-model-len $MAX_MODEL_LEN \
+--block-size=32 \
+--enable-expert-parallel \


@ankursingh-nv can u set the ep setting in nvidia-master.yaml and then in this script do if ep is set then enable-experrt-parallel

@claude create an new PR from remote main that talks situtations where vllm serve doesnt use expert-parallel-size directly --enable-expert-parallel in CLAUDE.md, claude.yml, claude-review.yml

Claude finished @functionstackx's task in 2m 7s —— View job

Creating PR for --enable-expert-parallel documentation

Gather context: read CLAUDE.md, claude.yml, claude-pr-review.yml

Create new branch from remote main

Add documentation about --enable-expert-parallel vs --expert-parallel-size usage

Commit and push changes

Create PR

Created PR #864: Document expert parallelism flag differences in Claude configs

Changes made to claude.yml and claude-pr-review.yml:

vLLM uses --enable-expert-parallel (boolean flag) — does NOT accept --expert-parallel-size

SGLang uses --expert-parallel-size N (explicit integer)

ATOM (AMD vLLM fork) uses --enable-expert-parallel (same as vLLM)

Scripts must conditionally enable --enable-expert-parallel based on EP_SIZE env var, not hardcode it

Added PR review validation rule to flag hardcoded --enable-expert-parallel

Note: There is no CLAUDE.md file in this repo — the Claude instructions live in the workflow prompt: fields in claude.yml and claude-pr-review.yml.
| Branch

@ankursingh-nv @kedarpotdar-nv can u also link vllm recipes for minimax and kimi k2.5 in the official nvidia docs too

https://docs.nvidia.com/llm-inference-quick-start-recipes/index.html

Sure, let me address the ep setting.

functionstackx · 2026-03-05T04:04:43Z

@claude update PR description, seems like the image is now 0.16 and not 0.15.1

Klaud-Cold · 2026-03-05T04:05:09Z

Claude finished @functionstackx's task in 32s —— View job

Updating PR Description

Check current image version in config
Update PR description

Updated the PR description to reflect the correct image version: vllm/vllm-openai:v0.16.0-cu130 (was v0.15.1-cu130).

| Branch

functionstackx · 2026-03-05T06:27:09Z

benchmarks/single_node/minimaxm2.5_fp8_b200.sh

+export VLLM_USE_FLASHINFER_MOE_FP8=0
+export VLLM_MOE_USE_DEEP_GEMM=0


document this too please in the vllm recipes

functionstackx · 2026-03-09T20:42:46Z

@ankursingh-nv is this PR ready? if so, can u ping @cquil11 for review

github-project-automation bot added this to InferenceMAX Board Feb 19, 2026

functionstackx added NVIDIA sweep-enabled labels Feb 19, 2026

github-actions bot and others added 2 commits February 19, 2026 13:46

Update runner name in nvidia-master.yaml

cbc5d0b

functionstackx force-pushed the claude/issue-756-20260219-0235 branch from 34f7756 to cbc5d0b Compare February 19, 2026 18:46

kedarpotdar-nv self-assigned this Feb 19, 2026

functionstackx removed the sweep-enabled label Feb 19, 2026

kedarpotdar-nv and others added 4 commits February 19, 2026 17:54

Merge branch 'main' into claude/issue-756-20260219-0235

1dde04d

move launch file and enable expert parallel

1d38392

add PR #

567a840

Merge branch 'main' into claude/issue-756-20260219-0235

700c15c

kedarpotdar-nv added the sweep-enabled label Feb 20, 2026

fix typo that claude made

13453b4

functionstackx removed the sweep-enabled label Feb 20, 2026

kedarpotdar-nv and others added 2 commits February 20, 2026 10:19

Merge branch 'main' into claude/issue-756-20260219-0235

836b89a

newline in perf-changelog

33b498f

kedarpotdar-nv added the sweep-enabled label Feb 20, 2026

functionstackx added 2 commits February 23, 2026 15:59

change to b200 tag

61228ef

Merge branch 'main' into claude/issue-756-20260219-0235

fd89de6

cquil11 added sweep-enabled and removed sweep-enabled labels Feb 23, 2026

functionstackx removed the sweep-enabled label Feb 24, 2026

fix: update HF_HUB_CACHE_MOUNT path in launch_b200-dgxc-slurm.sh

43800e6

Change HF_HUB_CACHE_MOUNT from /home/sa-shared/gharunners/hf-hub-cache/ to /scratch/fsw/gharunners/hf-hub-cache Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>

functionstackx added the sweep-enabled label Feb 24, 2026

functionstackx requested a review from kedarpotdar-nv February 24, 2026 02:47

functionstackx requested a review from ankursingh-nv February 24, 2026 02:47

functionstackx and others added 2 commits February 24, 2026 20:12

Merge branch 'main' into claude/issue-756-20260219-0235

8823744

Merge branch 'main' into claude/issue-756-20260219-0235

bb215cd

cquil11 requested a review from a team February 25, 2026 19:06

functionstackx removed the sweep-enabled label Feb 26, 2026

ankursingh-nv and others added 3 commits March 4, 2026 15:04

Merge branch 'main' into claude/issue-756-20260219-0235

7551200

Update minimaxm2.5-fp8-b200-vllm image to v0.16.0-cu130

735dd73

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

disable flashinfer fp8 MoE

b56773c

ankursingh-nv force-pushed the claude/issue-756-20260219-0235 branch from 66a55e0 to b56773c Compare March 4, 2026 23:06

ankursingh-nv requested a review from jgangani as a code owner March 4, 2026 23:06

ankursingh-nv added the sweep-enabled label Mar 4, 2026

functionstackx commented Mar 5, 2026

View reviewed changes

Klaud-Cold mentioned this pull request Mar 5, 2026

Document expert parallelism flag differences in Claude configs #864

Merged

functionstackx removed the sweep-enabled label Mar 5, 2026

disable VLLM_MOE_USE_DEEP_GEM

2e99ed7

kedarpotdar-nv added the sweep-enabled label Mar 5, 2026

functionstackx commented Mar 5, 2026

View reviewed changes

ep

c00ffa4

		export VLLM_USE_FLASHINFER_MOE_FP8=0
		export VLLM_MOE_USE_DEEP_GEMM=0

Conversation

functionstackx commented Feb 19, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kedarpotdar-nv commented Feb 20, 2026

Uh oh!

Klaud-Cold commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #757

Bug: Wrong benchmark_lib.sh source path

Benchmark Script (minimaxm2.5_fp8_b200.sh)

Config (nvidia-master.yaml)

Changelog (perf-changelog.yaml)

Summary

Uh oh!

functionstackx commented Feb 24, 2026

Uh oh!

Klaud-Cold commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementing fix in runners/launch_b200-dgxc-slurm.sh

Uh oh!

cquil11 commented Feb 25, 2026

Uh oh!

functionstackx commented Feb 26, 2026

Uh oh!

functionstackx Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Klaud-Cold Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Creating PR for --enable-expert-parallel documentation

Uh oh!

functionstackx Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

ankursingh-nv Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx commented Mar 5, 2026

Uh oh!

Klaud-Cold commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updating PR Description

Uh oh!

functionstackx Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

functionstackx commented Feb 19, 2026 •

edited by Klaud-Cold

Loading

Klaud-Cold commented Feb 20, 2026 •

edited

Loading

Bug: Wrong `benchmark_lib.sh` source path

Benchmark Script (`minimaxm2.5_fp8_b200.sh`)

Config (`nvidia-master.yaml`)

Changelog (`perf-changelog.yaml`)

Klaud-Cold commented Feb 24, 2026 •

edited

Loading

Implementing fix in `runners/launch_b200-dgxc-slurm.sh`

Klaud-Cold Mar 5, 2026 •

edited

Loading

Creating PR for `--enable-expert-parallel` documentation

Klaud-Cold commented Mar 5, 2026 •

edited

Loading